For this project you will take the role of a consultant hired by a real estate investment firm in Ames, Iowa, a mid-west town in the United States, to analyze data in order to help provide insight into how the firm should invest for highest profits, and to quantify and communicate to the company management what types of real estate properties are good investments and why. They have provided you with data on housing sales from between 2006 to 2010 that contains information about the characteristics of the house (number of bedrooms, number of bathrooms, square footage, etc.) and the house’s sale price. The codebook for this data set is available online here or in the Data folder in your repo.
It’s generally a bad idea to buy the most expensive house in the neighborhood. And remember the real estate agents’ mantra: Location, location, location! Keep in mind that the goal is to make money for your investors, and hence investing in a property that is overvalued (costing more than it is worth) is rarely a good idea. This means that it’s critical to know which properties are overvalued and which are undervalued. The company that hired you has many questions for you about the housing market. It is up to you to decide what methods you want to use (frequentist or Bayesian) to answer these questions, and implement them to help to identify undervalued and overvalued properties.
You will have three data sets: a subset for training, a subset for testing, and a third subset for validation. You will be asked to do data exploration and build your model (or models) initially using only the training data. Then, you will test your model on the testing data, and finally validate using the validation data. We are challenging you to keep your analysis experience realistic, and in a realistic scenario you would not have access to all three of these data sets at once. You will be able to see on our scoreboard how well your team is doing based on its predictive performance on the testing data. After your project is turned in you will see the final score on the validation set.
All members of the team should contribute equally and answer any questions about the analysis at the final presentation.
For your analysis create a new notebook named “project.Rmd” and update accordingly rather than editing this.
To get started read in the training data.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyr)
load("ames_train.Rdata")
print(paste0("The dataset has ", dim(ames_train)[1], " number of observations and ", dim(ames_train)[2], " features"))
[1] "The dataset has 1500 number of observations and 81 features"
#Variables with NA's and their proportion of missing data
miss = apply(is.na(ames_train), 2, sum)
miss_prop = round(miss[miss>0]/nrow(ames_train), 3)
print(miss_prop)
Lot.Frontage Alley Mas.Vnr.Area Bsmt.Qual Bsmt.Cond Bsmt.Exposure
0.188 0.930 0.004 0.034 0.034 0.034
BsmtFin.Type.1 BsmtFin.Type.2 Bsmt.Full.Bath Bsmt.Half.Bath Fireplace.Qu Garage.Type
0.034 0.034 0.001 0.001 0.474 0.046
Garage.Yr.Blt Garage.Finish Garage.Qual Garage.Cond Pool.QC Fence
0.047 0.046 0.047 0.047 0.995 0.789
Misc.Feature
0.965
which(miss_prop>0.5) # four features have greater than 50% of data "missing" -- drop these variables
Alley Pool.QC Fence Misc.Feature
2 17 18 19
Notes about data cleaning:
We dropped “utilities” (type of utilities available) since in the training set, only 2 observations did not have all the utilities (electricity, gas, water and sewage). Intuitively, most modern property are equipped with these basic public utilities and keeping the variable would therefore be unnecessary.
We also dropped “condition 2” (proximity to various conditions if more than one is present) since it seemed redundant from our training set. Given Condition 1, only 12 properties were not close to normal conditions.
In the original scale, 1990 and 1900 would not be much different. Therefore, we changed the scale to the number of years since last construction or remodelling, subtracted from year 2010 (the end year in the dataset).
Another variable dropped in our model was “roof material” since only 1% of the property used material other than “standard composite shingle”. Similarly, “heating” was also dropped since more than 95% of the property has gas forced warm air furnace (GasA) instead of other types of heating.
For “exterior quality” and “exterior condition” , we recoded these ordinal variables to 1-5 to replace the original scale of conditions (from poor to excellent). Similarly, we recoded “basement exposure” and “basement rating” except that the new scale would start from 0 for properties without basement.
Exter Qual (Ordinal): Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Exter Cond (Ordinal): Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Continuous variables such as “1st floor square feet” and “2nd floor square feet” were log-transformed for interpretation purpose.
For variable “functional”, we recoded different ordinal levels into binary levels — typical functionality or not, including minor and major deductions.
We summed up the number of bathrooms to one continuous variable. Note that one half-bathroom would be coded as 0.5.
# Did not remove any NA entries in Lot.frontage
data=ames_train
data <- data %>%
#filter(!is.na(Lot.Frontage)) %>%
mutate(MS.SubClass= factor(MS.SubClass)) %>%
mutate(Alley = factor(Alley, levels = levels(addNA(Alley)), labels = c(levels(Alley), "None"), exclude = NULL)) %>%
mutate(HouseAge = Yr.Sold- pmax(Year.Built, Year.Remod.Add)) %>%
filter(!is.na(Mas.Vnr.Area)) %>%
mutate(Bsmt.YN = 1*(!is.na(Bsmt.Qual))) %>%
mutate(Bsmt.Qual = factor(Bsmt.Qual, levels = levels(addNA(Bsmt.Qual)), labels = c(levels(Bsmt.Qual), "None"), exclude = NULL)) %>%
mutate(Bsmt.Qual = relevel(Bsmt.Qual, ref="None")) %>%
mutate(Bsmt.Cond = factor(Bsmt.Cond, levels = levels(addNA(Bsmt.Cond)), labels = c(levels(Bsmt.Cond), "None"), exclude = NULL)) %>%
mutate(Bsmt.Cond = relevel(Bsmt.Cond, ref="None")) %>%
mutate(Bsmt.Exposure = factor(Bsmt.Exposure, levels = levels(addNA(Bsmt.Exposure)), labels = c(levels(Bsmt.Exposure), "None"), exclude = NULL)) %>%
mutate(Bsmt.Exposure = relevel(Bsmt.Exposure, ref="None")) %>%
mutate(BsmtFin.Type.1= factor(BsmtFin.Type.1, levels = levels(addNA(BsmtFin.Type.1)), labels = c(levels(BsmtFin.Type.1), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.1 = relevel(BsmtFin.Type.1, ref="None")) %>%
mutate(BsmtFin.Type.2= factor(BsmtFin.Type.2, levels = levels(addNA(BsmtFin.Type.2)), labels = c(levels(BsmtFin.Type.2), "None"), exclude = NULL)) %>%
mutate(BsmtFin.Type.2 = relevel(BsmtFin.Type.2, ref="None")) %>%
mutate(X12.SF= X1st.Flr.SF+ X2nd.Flr.SF) %>%
filter(!is.na(Bsmt.Full.Bath)) %>%
filter(!is.na(Bsmt.Half.Bath)) %>%
mutate(Baths = Bsmt.Full.Bath + 0.5*Bsmt.Half.Bath + Full.Bath + 0.5*Half.Bath) %>%
mutate(Fireplace.YN = 1*(Fireplaces>0)) %>%
mutate(Fireplace.Qu = factor(Fireplace.Qu, levels = levels(addNA(Fireplace.Qu)), labels = c(levels(Fireplace.Qu), "None"), exclude = NULL)) %>%
mutate(Fireplace.Qu = relevel(Fireplace.Qu, ref="None")) %>%
mutate(Garage.YN = 1*(!is.na(Garage.Cond))) %>%
mutate(Garage.Type = factor(Garage.Type, levels = levels(addNA(Garage.Type)), labels = c(levels(Garage.Type), "None"), exclude = NULL)) %>%
mutate(Garage.Type = relevel(Garage.Type, ref="None")) %>%
mutate(Garage.Finish = factor(Garage.Finish, levels = levels(addNA(Garage.Finish)), labels = c(levels(Garage.Finish), "None"), exclude = NULL)) %>%
mutate(Garage.Finish = relevel(Garage.Finish, ref="None")) %>%
mutate(Garage.Qual = factor(Garage.Qual, levels = levels(addNA(Garage.Qual)), labels = c(levels(Garage.Qual), "None"), exclude = NULL)) %>%
mutate(Garage.Qual = relevel(Garage.Qual, ref="None")) %>%
mutate(Garage.Cond = factor(Garage.Cond, levels = levels(addNA(Garage.Cond)), labels = c(levels(Garage.Cond), "None"), exclude = NULL)) %>%
mutate(Garage.Cond = relevel(Garage.Cond, ref="None")) %>%
mutate(Porch.Area = Wood.Deck.SF+ Open.Porch.SF+Enclosed.Porch+X3Ssn.Porch + Screen.Porch) %>%
mutate(Pool.YN = 1*(Pool.Area>0)) %>%
mutate(Pool.QC = factor(Pool.QC, levels = levels(addNA(Pool.QC)), labels = c(levels(Pool.QC), "None"), exclude = NULL)) %>%
mutate(Pool.QC = relevel(Pool.QC, ref="None")) %>%
mutate(Fence = factor(Fence, levels = levels(addNA(Fence)), labels = c(levels(Fence), "None"), exclude = NULL)) %>%
mutate(Misc.Feature = factor(Misc.Feature, levels = levels(addNA(Misc.Feature)), labels = c(levels(Misc.Feature), "None"), exclude = NULL)) %>%
mutate(Mo.Sold = as.factor(Mo.Sold)) %>%
mutate(Yr.Sold = as.factor(Yr.Sold)) %>%
dplyr::select(-Garage.Yr.Blt) %>%
mutate(Condition.1 = as.character(Condition.1)) %>%
mutate(Kitchen.Qual=plyr::mapvalues(Kitchen.Qual, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Kitchen.Qual = as.numeric(as.character(Kitchen.Qual))) %>%
mutate(Heating.QC=plyr::mapvalues(Heating.QC, from = c("Po", "Fa", "TA","Gd", "Ex" ), to = c("1", "2", "3", "4", "5"))) %>%
mutate(Heating.QC = as.numeric(as.character(Heating.QC))) %>%
mutate(Bsmt.Qual = droplevels(Bsmt.Qual)) %>%
mutate(Functional = droplevels(Functional)) %>%
mutate(Roof.Matl = droplevels(Roof.Matl))
# Simplify Condition 1 (Park, Rail, Normal)
ind_rail<-which(data$Condition.1=="RRNn" | data$Condition.1=="RRAn" | data$Condition.1=="RRNe" | data$Condition.1=="RRAe")
ind_park<-which(data$Condition.1=="PosN" | data$Condition.1=="PosA")
data$Condition.1[ind_rail]<-"Rail"
data$Condition.1[ind_park]<-"Park"
data = data %>%
mutate(Condition.1 = factor(Condition.1)) %>%
mutate(Condition.1 = relevel(Condition.1, ref="Norm"))
# Eliminate the one entry in 'Exposure' that had been left completely empty
data_train<-data
data_train$Bsmt.Exposure[which(data_train$Bsmt.Exposure=="")]<-"None"
data_train$Bsmt.Exposure<-droplevels(data_train$Bsmt.Exposure)
data_train$Pool.Area<-data_train$Pool.Area+1
data_train$Total.Bsmt.SF<-data_train$Total.Bsmt.SF+1
The Neighborhood variable, typically of little interest other than to model the location effect, may be of more relevance when used with the map.
We are restricting attention to just the “normal sales” condition.
In the first model you are allowed only limited manipulations of the original data set to predict the sales price price. You are allowed to take power transformations of the original variables [square roots, logs, inverses, squares, etc.] but you are NOT allowed to create interaction variables. This means that a variable may only be used once in an equation [if you use $ x^2$ don’t use \(x\)]. Additionally, you may eliminate any data points you deem unfit. This model should have a minimum r-square of 73% (in the original units) and contain at least 6 variables but fewer than 20.
### perfromance evlaution function
performance<- function(Y, Yhat){
bias<- mean(Y-Yhat[,1])
max.dev<-max(abs(Y-Yhat[,1]))
mean.dev<-mean(abs(Y-Yhat[,1]))
RMSE<-sqrt(mean((Y-Yhat[,1])^2))
coverage<-mean((Y>Yhat[,2]) & (Y<Yhat[,3]))
out<-data.frame(bias=bias, max.dev=max.dev, mean.dev=mean.dev, RMES=RMSE, Coverage=coverage)
return(out)
}
library(MASS)
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
# Base model with transformed predictors
model=lm(price ~ MS.SubClass + MS.Zoning + log(Lot.Frontage) + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure + Total.Bsmt.SF + Heating + Heating.QC + Central.Air + Electrical + log(X12.SF) + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type + Garage.Finish + Garage.Cars + Garage.Cond + Garage.Qual + Paved.Drive + log(1+Pool.Area) + Pool.QC + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + TotalSq, data=data_train)
# Boxcox (indicates that log is decent)
l<-boxcox(model)
expo<-round(l$x[which.max(l$y)],2)
## Current model
## log(Lot.Frontage) currently removed to have more data points (was not included when left in the model with BIC)
model.0=lm(log(price) ~ MS.SubClass + MS.Zoning + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Cond + Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure + log(Total.Bsmt.SF) + Bsmt.YN+ Heating + Heating.QC + Central.Air + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type + Garage.Finish + Garage.Cars +Garage.Qual+Garage.Cond + Paved.Drive + log(Pool.Area) + Pool.QC + Fence + Misc.Val + Mo.Sold +Yr.Sold + Sale.Type + log(TotalSq) + Pool.YN , data=data_train)
# There are some perfect collinearities in this model -> eliminate it via AIC/BIC
summary(model.0)
Call:
lm(formula = log(price) ~ MS.SubClass + MS.Zoning + log(Lot.Area) +
Street + Alley + Lot.Shape + Land.Contour + Lot.Config +
Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style +
Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl +
Exterior.1st + Mas.Vnr.Type + log(1 + Mas.Vnr.Area) + Exter.Cond +
Exter.Qual + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure +
log(Total.Bsmt.SF) + Bsmt.YN + Heating + Heating.QC + Central.Air +
log(1 + Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Fireplace.Qu + Garage.Type +
Garage.Finish + Garage.Cars + Garage.Qual + Garage.Cond +
Paved.Drive + log(Pool.Area) + Pool.QC + Fence + Misc.Val +
Mo.Sold + Yr.Sold + Sale.Type + log(TotalSq) + Pool.YN, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.43874 -0.04567 0.00024 0.04925 0.25557
Coefficients: (8 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.135e+00 2.287e-01 31.199 < 2e-16 ***
MS.SubClass30 -7.720e-02 1.621e-02 -4.762 2.13e-06 ***
MS.SubClass40 -2.625e-02 5.441e-02 -0.483 0.629528
MS.SubClass45 -1.585e-01 1.216e-01 -1.304 0.192564
MS.SubClass50 -2.953e-02 3.292e-02 -0.897 0.369868
MS.SubClass60 -3.433e-02 2.747e-02 -1.250 0.211688
MS.SubClass70 -7.417e-02 2.987e-02 -2.483 0.013157 *
MS.SubClass75 -1.948e-02 8.094e-02 -0.241 0.809826
MS.SubClass80 -6.554e-02 4.932e-02 -1.329 0.184105
MS.SubClass85 -3.716e-02 3.572e-02 -1.040 0.298403
MS.SubClass90 -2.750e-02 2.873e-02 -0.957 0.338788
MS.SubClass120 3.104e-02 4.763e-02 0.652 0.514768
MS.SubClass150 -1.220e-01 1.210e-01 -1.008 0.313770
MS.SubClass160 -5.059e-02 5.922e-02 -0.854 0.393087
MS.SubClass180 -2.540e-02 8.118e-02 -0.313 0.754415
MS.SubClass190 -7.996e-02 9.993e-02 -0.800 0.423747
MS.ZoningC (all) -8.623e-02 1.100e-01 -0.784 0.433200
MS.ZoningFV 1.407e-01 1.042e-01 1.351 0.176927
MS.ZoningI (all) -2.767e-02 1.291e-01 -0.214 0.830289
MS.ZoningRH 1.659e-01 1.037e-01 1.599 0.110075
MS.ZoningRL 1.231e-01 9.957e-02 1.236 0.216735
MS.ZoningRM 7.809e-02 1.008e-01 0.775 0.438763
log(Lot.Area) 9.002e-02 1.010e-02 8.912 < 2e-16 ***
StreetPave -1.815e-02 4.429e-02 -0.410 0.682128
AlleyPave -1.134e-02 2.326e-02 -0.488 0.625946
AlleyNone 1.801e-02 1.395e-02 1.292 0.196747
Lot.ShapeIR2 -6.732e-04 1.582e-02 -0.043 0.966069
Lot.ShapeIR3 6.258e-03 3.092e-02 0.202 0.839656
Lot.ShapeReg -1.546e-03 5.964e-03 -0.259 0.795449
Land.ContourHLS 2.953e-02 1.852e-02 1.594 0.111086
Land.ContourLow -5.651e-03 2.377e-02 -0.238 0.812139
Land.ContourLvl 2.114e-02 1.375e-02 1.537 0.124571
Lot.ConfigCulDSac 1.419e-02 1.163e-02 1.220 0.222861
Lot.ConfigFR2 -3.046e-02 1.445e-02 -2.108 0.035210 *
Lot.ConfigFR3 -2.514e-02 4.191e-02 -0.600 0.548733
Lot.ConfigInside 1.273e-02 6.572e-03 1.937 0.053009 .
Land.SlopeMod 1.605e-02 1.453e-02 1.104 0.269592
Land.SlopeSev -7.217e-02 5.346e-02 -1.350 0.177255
NeighborhoodBlueste 1.012e-01 4.633e-02 2.184 0.029154 *
NeighborhoodBrDale 1.193e-02 4.322e-02 0.276 0.782511
NeighborhoodBrkSide 2.929e-02 3.587e-02 0.817 0.414264
NeighborhoodClearCr 1.518e-02 3.742e-02 0.406 0.685092
NeighborhoodCollgCr -4.175e-03 3.026e-02 -0.138 0.890275
NeighborhoodCrawfor 6.731e-02 3.434e-02 1.960 0.050179 .
NeighborhoodEdwards -8.750e-02 3.226e-02 -2.712 0.006778 **
NeighborhoodGilbert -4.620e-02 3.155e-02 -1.464 0.143425
NeighborhoodGreens 5.330e-02 4.491e-02 1.187 0.235555
NeighborhoodGrnHill 4.587e-01 7.033e-02 6.522 9.94e-11 ***
NeighborhoodIDOTRR -4.480e-02 3.976e-02 -1.127 0.260077
NeighborhoodLandmrk -8.096e-02 9.648e-02 -0.839 0.401562
NeighborhoodMeadowV -9.830e-02 4.989e-02 -1.970 0.049004 *
NeighborhoodMitchel -1.939e-02 3.247e-02 -0.597 0.550505
NeighborhoodNAmes -5.754e-02 3.165e-02 -1.818 0.069271 .
NeighborhoodNoRidge 7.880e-02 3.380e-02 2.332 0.019881 *
NeighborhoodNPkVill 4.515e-02 4.172e-02 1.082 0.279425
NeighborhoodNridgHt 3.993e-02 3.232e-02 1.235 0.216929
NeighborhoodNWAmes -3.705e-02 3.264e-02 -1.135 0.256547
NeighborhoodOldTown -5.117e-02 3.647e-02 -1.403 0.160835
NeighborhoodSawyer -1.697e-02 3.244e-02 -0.523 0.600886
NeighborhoodSawyerW -3.472e-02 3.147e-02 -1.103 0.270103
NeighborhoodSomerst 7.220e-02 4.009e-02 1.801 0.071939 .
NeighborhoodStoneBr 7.764e-02 3.451e-02 2.250 0.024620 *
NeighborhoodSWISU -4.571e-02 3.748e-02 -1.220 0.222869
NeighborhoodTimber -2.846e-02 3.383e-02 -0.841 0.400336
NeighborhoodVeenker 2.246e-02 3.973e-02 0.565 0.571929
Condition.1Artery -7.044e-02 1.535e-02 -4.589 4.90e-06 ***
Condition.1Feedr -7.402e-02 1.141e-02 -6.486 1.25e-10 ***
Condition.1Park -2.670e-04 1.739e-02 -0.015 0.987752
Condition.1Rail -5.997e-02 1.480e-02 -4.052 5.38e-05 ***
Bldg.Type2fmCon 3.790e-02 9.699e-02 0.391 0.696031
Bldg.TypeDuplex NA NA NA NA
Bldg.TypeTwnhs -4.101e-02 4.983e-02 -0.823 0.410662
Bldg.TypeTwnhsE -3.401e-02 4.657e-02 -0.730 0.465340
House.Style1.5Unf 1.452e-01 1.200e-01 1.210 0.226482
House.Style1Story 4.451e-02 3.137e-02 1.419 0.156180
House.Style2.5Fin -8.497e-02 9.804e-02 -0.867 0.386254
House.Style2.5Unf 4.834e-03 8.411e-02 0.057 0.954181
House.Style2Story 4.421e-02 2.947e-02 1.500 0.133822
House.StyleSFoyer 9.402e-02 4.126e-02 2.279 0.022844 *
House.StyleSLvl 1.099e-01 5.280e-02 2.082 0.037563 *
Overall.Qual 4.386e-02 3.647e-03 12.025 < 2e-16 ***
Overall.Cond 3.438e-02 3.001e-03 11.455 < 2e-16 ***
HouseAge -5.985e-04 1.928e-04 -3.104 0.001954 **
Roof.StyleGable -6.870e-02 5.100e-02 -1.347 0.178219
Roof.StyleGambrel -1.341e-01 5.823e-02 -2.303 0.021454 *
Roof.StyleHip -7.569e-02 5.137e-02 -1.473 0.140909
Roof.StyleMansard -9.772e-02 6.256e-02 -1.562 0.118568
Roof.StyleShed -1.860e-02 9.228e-02 -0.202 0.840325
Roof.MatlMembran 8.715e-02 1.139e-01 0.765 0.444332
Roof.MatlRoll 7.002e-02 9.408e-02 0.744 0.456888
Roof.MatlTar&Grv 3.100e-02 3.754e-02 0.826 0.408998
Roof.MatlWdShake 1.155e-02 4.850e-02 0.238 0.811768
Roof.MatlWdShngl 7.904e-02 5.041e-02 1.568 0.117148
Exterior.1stAsphShn -3.774e-02 9.304e-02 -0.406 0.685059
Exterior.1stBrkComm 1.197e-01 7.003e-02 1.710 0.087573 .
Exterior.1stBrkFace 6.513e-02 2.716e-02 2.398 0.016616 *
Exterior.1stCBlock NA NA NA NA
Exterior.1stCemntBd 4.768e-02 2.816e-02 1.693 0.090731 .
Exterior.1stHdBoard 1.592e-02 2.409e-02 0.661 0.508939
Exterior.1stImStucc -1.087e-02 9.163e-02 -0.119 0.905605
Exterior.1stMetalSd 2.877e-02 2.355e-02 1.222 0.222118
Exterior.1stPlywood 9.545e-03 2.532e-02 0.377 0.706277
Exterior.1stPreCast 3.289e-01 1.013e-01 3.245 0.001204 **
Exterior.1stStucco 1.632e-02 3.006e-02 0.543 0.587327
Exterior.1stVinylSd 3.341e-02 2.395e-02 1.395 0.163211
Exterior.1stWd Sdng 1.428e-02 2.354e-02 0.607 0.544233
Exterior.1stWdShing 2.649e-02 2.844e-02 0.932 0.351759
Mas.Vnr.TypeBrkFace 5.390e-02 2.675e-02 2.015 0.044108 *
Mas.Vnr.TypeNone 1.103e-01 3.638e-02 3.031 0.002488 **
Mas.Vnr.TypeStone 5.822e-02 2.849e-02 2.044 0.041189 *
log(1 + Mas.Vnr.Area) 1.221e-02 5.113e-03 2.387 0.017131 *
Exter.CondFa -2.636e-02 4.346e-02 -0.606 0.544314
Exter.CondGd 2.375e-02 3.777e-02 0.629 0.529609
Exter.CondPo -1.132e-01 1.012e-01 -1.119 0.263270
Exter.CondTA 3.107e-02 3.777e-02 0.823 0.410846
Exter.QualFa -5.306e-02 3.688e-02 -1.439 0.150478
Exter.QualGd -5.349e-02 1.970e-02 -2.715 0.006716 **
Exter.QualTA -5.260e-02 2.179e-02 -2.414 0.015904 *
FoundationCBlock 2.047e-02 1.118e-02 1.831 0.067407 .
FoundationPConc 4.603e-02 1.188e-02 3.873 0.000113 ***
FoundationSlab -6.749e-03 3.072e-02 -0.220 0.826142
FoundationStone 4.719e-02 4.164e-02 1.133 0.257252
FoundationWood 5.419e-02 5.410e-02 1.002 0.316657
Bsmt.QualEx -5.934e-01 1.230e-01 -4.826 1.56e-06 ***
Bsmt.QualFa -6.228e-01 1.225e-01 -5.084 4.25e-07 ***
Bsmt.QualGd -6.587e-01 1.224e-01 -5.383 8.69e-08 ***
Bsmt.QualPo -1.649e-01 1.733e-01 -0.952 0.341390
Bsmt.QualTA -6.586e-01 1.221e-01 -5.395 8.14e-08 ***
Bsmt.CondEx -5.374e-04 6.234e-02 -0.009 0.993124
Bsmt.CondFa -2.399e-02 1.472e-02 -1.630 0.103428
Bsmt.CondGd 1.951e-03 1.303e-02 0.150 0.881017
Bsmt.CondPo 8.582e-02 7.715e-02 1.112 0.266197
Bsmt.CondTA NA NA NA NA
Bsmt.ExposureAv 1.326e-01 8.588e-02 1.544 0.122909
Bsmt.ExposureGd 1.633e-01 8.631e-02 1.892 0.058769 .
Bsmt.ExposureMn 9.901e-02 8.612e-02 1.150 0.250533
Bsmt.ExposureNo 1.089e-01 8.585e-02 1.269 0.204848
log(Total.Bsmt.SF) 9.271e-02 1.217e-02 7.616 5.06e-14 ***
Bsmt.YN NA NA NA NA
HeatingGasA 7.688e-02 9.347e-02 0.823 0.410924
HeatingGasW 1.523e-01 9.678e-02 1.574 0.115838
HeatingGrav 1.256e-02 1.146e-01 0.110 0.912736
HeatingOthW 3.011e-02 1.146e-01 0.263 0.792809
HeatingWall 1.054e-01 1.038e-01 1.015 0.310188
Heating.QC 8.207e-03 3.392e-03 2.419 0.015689 *
Central.AirY 5.164e-02 1.337e-02 3.862 0.000118 ***
log(1 + Low.Qual.Fin.SF) 1.195e-02 4.013e-03 2.977 0.002968 **
Baths 4.982e-02 4.577e-03 10.884 < 2e-16 ***
Bedroom.AbvGr -1.038e-02 4.588e-03 -2.263 0.023781 *
Kitchen.AbvGr -1.030e-01 2.454e-02 -4.199 2.87e-05 ***
Kitchen.Qual 3.426e-02 5.840e-03 5.866 5.66e-09 ***
FunctionalMaj2 -2.494e-01 5.193e-02 -4.804 1.74e-06 ***
FunctionalMin1 -1.625e-02 3.566e-02 -0.456 0.648641
FunctionalMin2 -1.274e-02 3.606e-02 -0.353 0.723946
FunctionalMod -3.512e-02 3.942e-02 -0.891 0.373152
FunctionalTyp 4.746e-02 3.290e-02 1.443 0.149399
Fireplaces 2.516e-02 8.781e-03 2.865 0.004240 **
Fireplace.QuEx 1.185e-02 2.345e-02 0.505 0.613479
Fireplace.QuFa -3.809e-03 1.740e-02 -0.219 0.826753
Fireplace.QuGd 8.029e-03 1.221e-02 0.658 0.510907
Fireplace.QuPo 1.114e-02 2.043e-02 0.545 0.585585
Fireplace.QuTA 1.034e-03 1.211e-02 0.085 0.931945
Garage.Type2Types -2.409e-02 3.354e-02 -0.718 0.472872
Garage.TypeAttchd 7.027e-03 1.677e-02 0.419 0.675292
Garage.TypeBasment -1.302e-02 3.140e-02 -0.415 0.678518
Garage.TypeBuiltIn 1.970e-02 1.991e-02 0.989 0.322625
Garage.TypeCarPort -7.919e-02 5.498e-02 -1.440 0.150039
Garage.TypeDetchd 1.634e-02 1.637e-02 0.998 0.318665
Garage.Finish -7.485e-02 1.519e-01 -0.493 0.622210
Garage.FinishFin 1.826e-02 8.378e-03 2.180 0.029443 *
Garage.FinishRFn -5.550e-03 7.428e-03 -0.747 0.455052
Garage.FinishUnf NA NA NA NA
Garage.Cars 3.427e-02 5.610e-03 6.109 1.33e-09 ***
Garage.QualFa -1.092e-02 1.439e-02 -0.759 0.448201
Garage.QualGd 4.510e-02 2.910e-02 1.550 0.121430
Garage.QualPo -3.117e-01 7.767e-02 -4.013 6.35e-05 ***
Garage.QualTA NA NA NA NA
Garage.CondFa -6.362e-02 1.804e-02 -3.527 0.000435 ***
Garage.CondGd 6.926e-04 4.160e-02 0.017 0.986719
Garage.CondPo 7.258e-02 4.399e-02 1.650 0.099151 .
Garage.CondTA NA NA NA NA
Paved.DriveP -1.649e-02 1.865e-02 -0.884 0.376870
Paved.DriveY 2.848e-02 1.229e-02 2.318 0.020627 *
log(Pool.Area) -2.200e-01 2.224e-01 -0.989 0.322688
Pool.QCEx 1.317e+00 1.152e+00 1.143 0.253103
Pool.QCFa 1.404e+00 1.416e+00 0.991 0.321709
Pool.QCGd 1.446e+00 1.447e+00 0.999 0.317933
Pool.QCTA 1.684e+00 1.362e+00 1.236 0.216777
FenceGdWo 1.227e-03 1.707e-02 0.072 0.942711
FenceMnPrv -1.599e-02 1.353e-02 -1.182 0.237505
FenceMnWw -4.282e-02 3.570e-02 -1.199 0.230655
FenceNone -1.278e-02 1.234e-02 -1.036 0.300596
Misc.Val 8.128e-07 4.966e-06 0.164 0.870024
Mo.Sold2 -9.601e-03 1.705e-02 -0.563 0.573370
Mo.Sold3 -1.316e-02 1.525e-02 -0.863 0.388463
Mo.Sold4 1.720e-02 1.479e-02 1.163 0.245173
Mo.Sold5 1.143e-02 1.415e-02 0.808 0.419406
Mo.Sold6 1.087e-02 1.386e-02 0.785 0.432847
Mo.Sold7 1.317e-02 1.400e-02 0.941 0.347126
Mo.Sold8 1.980e-03 1.537e-02 0.129 0.897511
[ reached getOption("max.print") -- omitted 18 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08396 on 1283 degrees of freedom
Multiple R-squared: 0.9562, Adjusted R-squared: 0.9491
F-statistic: 134 on 209 and 1283 DF, p-value: < 2.2e-16
plot(model.0)
not plotting observations with leverage one:
47, 168, 242, 284, 411, 638, 655, 739, 804, 990, 1011, 1194, 1198, 1324, 1344, 1376
not plotting observations with leverage one:
47, 168, 242, 284, 411, 638, 655, 739, 804, 990, 1011, 1194, 1198, 1324, 1344, 1376
NaNs producedNaNs produced
#AIC
#model.AIC=step(model.0, k=2)
#summary(model.AIC)
#plot(model.AIC)
#BIC
#model.BIC=step(model.0, k=log(nrow(data_train)))
model.BIC=lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
summary(model.BIC)
Call:
lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.66831 -0.05150 0.00025 0.05629 0.32251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3926554 0.1169000 63.239 < 2e-16 ***
log(Lot.Area) 0.1026436 0.0075547 13.587 < 2e-16 ***
NeighborhoodBlueste 0.0146631 0.0419500 0.350 0.726738
NeighborhoodBrDale -0.0634618 0.0347412 -1.827 0.067954 .
NeighborhoodBrkSide -0.0219612 0.0306986 -0.715 0.474491
NeighborhoodClearCr 0.0028061 0.0338084 0.083 0.933862
NeighborhoodCollgCr -0.0131967 0.0275606 -0.479 0.632136
NeighborhoodCrawfor 0.0479177 0.0311068 1.540 0.123679
NeighborhoodEdwards -0.1000400 0.0294309 -3.399 0.000695 ***
NeighborhoodGilbert -0.0379098 0.0287188 -1.320 0.187036
NeighborhoodGreens 0.0359676 0.0442891 0.812 0.416864
NeighborhoodGrnHill 0.4397167 0.0709210 6.200 7.38e-10 ***
NeighborhoodIDOTRR -0.1451682 0.0324860 -4.469 8.49e-06 ***
NeighborhoodLandmrk -0.0681952 0.0949941 -0.718 0.472943
NeighborhoodMeadowV -0.1266187 0.0383000 -3.306 0.000970 ***
NeighborhoodMitchel -0.0368471 0.0295088 -1.249 0.211987
NeighborhoodNAmes -0.0565614 0.0285542 -1.981 0.047801 *
NeighborhoodNoRidge 0.0623060 0.0303150 2.055 0.040034 *
NeighborhoodNPkVill -0.0208120 0.0384833 -0.541 0.588726
NeighborhoodNridgHt 0.0509080 0.0290237 1.754 0.079643 .
NeighborhoodNWAmes -0.0544560 0.0297828 -1.828 0.067693 .
NeighborhoodOldTown -0.1188641 0.0294370 -4.038 5.68e-05 ***
NeighborhoodSawyer -0.0229563 0.0298927 -0.768 0.442639
NeighborhoodSawyerW -0.0515292 0.0289758 -1.778 0.075559 .
NeighborhoodSomerst 0.0566938 0.0276847 2.048 0.040759 *
NeighborhoodStoneBr 0.0745779 0.0321385 2.321 0.020453 *
NeighborhoodSWISU -0.0604508 0.0340779 -1.774 0.076293 .
NeighborhoodTimber -0.0324647 0.0317804 -1.022 0.307176
NeighborhoodVeenker -0.0167331 0.0376356 -0.445 0.656671
Condition.1Artery -0.0760394 0.0151294 -5.026 5.64e-07 ***
Condition.1Feedr -0.0767356 0.0113456 -6.763 1.96e-11 ***
Condition.1Park 0.0069725 0.0182004 0.383 0.701706
Condition.1Rail -0.0496206 0.0149453 -3.320 0.000922 ***
Overall.Qual 0.0494652 0.0034280 14.430 < 2e-16 ***
Overall.Cond 0.0351455 0.0027881 12.605 < 2e-16 ***
HouseAge -0.0008503 0.0001883 -4.516 6.84e-06 ***
FoundationCBlock 0.0583435 0.0105020 5.555 3.30e-08 ***
FoundationPConc 0.0718433 0.0115567 6.217 6.66e-10 ***
FoundationSlab 0.0630207 0.0291017 2.166 0.030512 *
FoundationStone 0.0038723 0.0390916 0.099 0.921107
FoundationWood 0.0465776 0.0554929 0.839 0.401418
Bsmt.QualEx -0.7570306 0.1136937 -6.659 3.94e-11 ***
Bsmt.QualFa -0.8357544 0.1124732 -7.431 1.85e-13 ***
Bsmt.QualGd -0.8407558 0.1123115 -7.486 1.24e-13 ***
Bsmt.QualPo -0.8411315 0.1443930 -5.825 7.04e-09 ***
Bsmt.QualTA -0.8503101 0.1122157 -7.577 6.31e-14 ***
Bsmt.ExposureAv 0.1590409 0.0918543 1.731 0.083588 .
Bsmt.ExposureGd 0.1929930 0.0921545 2.094 0.036415 *
Bsmt.ExposureMn 0.1151061 0.0920620 1.250 0.211392
Bsmt.ExposureNo 0.1296663 0.0917945 1.413 0.158000
log(Total.Bsmt.SF) 0.1191674 0.0091824 12.978 < 2e-16 ***
Heating.QC 0.0091548 0.0033638 2.722 0.006577 **
Central.AirY 0.0587416 0.0121548 4.833 1.49e-06 ***
Baths 0.0470497 0.0045251 10.397 < 2e-16 ***
Bedroom.AbvGr -0.0140992 0.0042615 -3.308 0.000961 ***
Kitchen.AbvGr -0.0977644 0.0140392 -6.964 5.05e-12 ***
Kitchen.Qual 0.0346078 0.0056027 6.177 8.51e-10 ***
FunctionalMaj2 -0.1661268 0.0518564 -3.204 0.001387 **
FunctionalMin1 0.0143539 0.0345741 0.415 0.678085
FunctionalMin2 0.0352368 0.0342227 1.030 0.303356
FunctionalMod -0.0059146 0.0372751 -0.159 0.873948
FunctionalTyp 0.0882264 0.0312921 2.819 0.004877 **
Fireplaces 0.0278606 0.0047452 5.871 5.37e-09 ***
Garage.Cars 0.0400643 0.0046975 8.529 < 2e-16 ***
Paved.DriveP 0.0101521 0.0181899 0.558 0.576854
Paved.DriveY 0.0573398 0.0114111 5.025 5.67e-07 ***
log(Pool.Area) 0.0175609 0.0059007 2.976 0.002969 **
log(TotalSq) 0.3701184 0.0154033 24.029 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09123 on 1425 degrees of freedom
Multiple R-squared: 0.9426, Adjusted R-squared: 0.9399
F-statistic: 349 on 67 and 1425 DF, p-value: < 2.2e-16
plot(model.BIC)
not plotting observations with leverage one:
638, 655, 990
not plotting observations with leverage one:
638, 655, 990
modelinteract = lm(log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual:Baths:Neighborhood + Overall.Qual:Baths + Overall.Qual:Neighborhood + Overall.Qual + Baths + Neighborhood + Overall.Qual:Baths:log(Total.Bsmt.SF) + Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) + Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) + Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area), data = data_train)
model.interact.reduced=lm(log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual +
Baths + Neighborhood + Garage.Cars + log(Total.Bsmt.SF) +
log(TotalSq) + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + Heating.QC + Central.Air + Bedroom.AbvGr +
Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces +
Paved.Drive + Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) +
Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq),
data = data_train)
summary(model.interact.reduced)
Call:
lm(formula = log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual +
Baths + Neighborhood + Garage.Cars + log(Total.Bsmt.SF) +
log(TotalSq) + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + Heating.QC + Central.Air + Bedroom.AbvGr +
Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces +
Paved.Drive + Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) +
Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq),
data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.66963 -0.05131 0.00028 0.05415 0.31533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5102614 0.4057490 16.045 < 2e-16 ***
log(Lot.Area) 0.0987836 0.0075415 13.099 < 2e-16 ***
Condition.1Artery -0.0736717 0.0150310 -4.901 1.06e-06 ***
Condition.1Feedr -0.0745876 0.0112824 -6.611 5.39e-11 ***
Condition.1Park -0.0051531 0.0181031 -0.285 0.775950
Condition.1Rail -0.0456100 0.0148189 -3.078 0.002125 **
Overall.Qual 0.2068733 0.0752569 2.749 0.006055 **
Baths 0.0468993 0.0045396 10.331 < 2e-16 ***
NeighborhoodBlueste 0.0203947 0.0416095 0.490 0.624106
NeighborhoodBrDale -0.0722602 0.0345132 -2.094 0.036463 *
NeighborhoodBrkSide -0.0217135 0.0304547 -0.713 0.475977
NeighborhoodClearCr 0.0103092 0.0336539 0.306 0.759399
NeighborhoodCollgCr -0.0159676 0.0273774 -0.583 0.559823
NeighborhoodCrawfor 0.0499783 0.0308951 1.618 0.105954
NeighborhoodEdwards -0.1032780 0.0291913 -3.538 0.000416 ***
NeighborhoodGilbert -0.0368491 0.0285011 -1.293 0.196256
NeighborhoodGreens 0.0574166 0.0441283 1.301 0.193425
NeighborhoodGrnHill 0.4438689 0.0705238 6.294 4.11e-10 ***
NeighborhoodIDOTRR -0.1470244 0.0322360 -4.561 5.53e-06 ***
NeighborhoodLandmrk -0.0700365 0.0941311 -0.744 0.456981
NeighborhoodMeadowV -0.1441208 0.0380517 -3.787 0.000159 ***
NeighborhoodMitchel -0.0346045 0.0293105 -1.181 0.237952
NeighborhoodNAmes -0.0586303 0.0283935 -2.065 0.039111 *
NeighborhoodNoRidge 0.0250677 0.0307515 0.815 0.415112
NeighborhoodNPkVill -0.0218909 0.0381536 -0.574 0.566222
NeighborhoodNridgHt 0.0317326 0.0289648 1.096 0.273459
NeighborhoodNWAmes -0.0465829 0.0295620 -1.576 0.115302
NeighborhoodOldTown -0.1206424 0.0292366 -4.126 3.90e-05 ***
NeighborhoodSawyer -0.0277534 0.0297174 -0.934 0.350509
NeighborhoodSawyerW -0.0496333 0.0287674 -1.725 0.084685 .
NeighborhoodSomerst 0.0576101 0.0274356 2.100 0.035919 *
NeighborhoodStoneBr 0.0743440 0.0318815 2.332 0.019846 *
NeighborhoodSWISU -0.0658180 0.0339123 -1.941 0.052476 .
NeighborhoodTimber -0.0245197 0.0315509 -0.777 0.437201
NeighborhoodVeenker -0.0060078 0.0372772 -0.161 0.871985
Garage.Cars 1.2032426 0.2341509 5.139 3.15e-07 ***
log(Total.Bsmt.SF) 0.1189555 0.0091090 13.059 < 2e-16 ***
log(TotalSq) 0.5012898 0.0581418 8.622 < 2e-16 ***
Overall.Cond 0.0336221 0.0027929 12.039 < 2e-16 ***
HouseAge -0.0009157 0.0001870 -4.896 1.09e-06 ***
FoundationCBlock 0.0589952 0.0104102 5.667 1.76e-08 ***
FoundationPConc 0.0731980 0.0114720 6.381 2.38e-10 ***
FoundationSlab 0.0718932 0.0289054 2.487 0.012990 *
FoundationStone 0.0044701 0.0389376 0.115 0.908619
FoundationWood 0.0392228 0.0550548 0.712 0.476314
Bsmt.QualEx -0.7632015 0.1127644 -6.768 1.90e-11 ***
Bsmt.QualFa -0.8262069 0.1114845 -7.411 2.14e-13 ***
Bsmt.QualGd -0.8267350 0.1113854 -7.422 1.97e-13 ***
Bsmt.QualPo -0.8358514 0.1430638 -5.843 6.36e-09 ***
Bsmt.QualTA -0.8376454 0.1112764 -7.528 9.13e-14 ***
Bsmt.ExposureAv 0.1575361 0.0910068 1.731 0.083662 .
Bsmt.ExposureGd 0.1836363 0.0913334 2.011 0.044555 *
Bsmt.ExposureMn 0.1122697 0.0912138 1.231 0.218586
Bsmt.ExposureNo 0.1280950 0.0909443 1.408 0.159202
Heating.QC 0.0093065 0.0033440 2.783 0.005457 **
Central.AirY 0.0636338 0.0121613 5.232 1.92e-07 ***
Bedroom.AbvGr -0.0147371 0.0042642 -3.456 0.000564 ***
Kitchen.AbvGr -0.0910776 0.0142064 -6.411 1.96e-10 ***
Kitchen.Qual 0.0332874 0.0055725 5.973 2.93e-09 ***
FunctionalMaj2 -0.1605476 0.0514362 -3.121 0.001837 **
FunctionalMin1 0.0184235 0.0343702 0.536 0.592021
FunctionalMin2 0.0407226 0.0339893 1.198 0.231077
FunctionalMod 0.0029487 0.0370700 0.080 0.936611
FunctionalTyp 0.0888654 0.0310927 2.858 0.004324 **
Fireplaces 0.0296062 0.0047109 6.285 4.36e-10 ***
Paved.DriveP 0.0171341 0.0181069 0.946 0.344168
Paved.DriveY 0.0605392 0.0114220 5.300 1.34e-07 ***
Overall.Qual:Garage.Cars -0.1928486 0.0384268 -5.019 5.86e-07 ***
Overall.Qual:log(TotalSq) -0.0225185 0.0105342 -2.138 0.032715 *
Garage.Cars:log(TotalSq) -0.1633273 0.0325288 -5.021 5.79e-07 ***
Overall.Qual:Garage.Cars:log(TotalSq) 0.0269148 0.0052360 5.140 3.12e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09038 on 1422 degrees of freedom
Multiple R-squared: 0.9437, Adjusted R-squared: 0.941
F-statistic: 340.7 on 70 and 1422 DF, p-value: < 2.2e-16
plot(model.interact.reduced)
not plotting observations with leverage one:
638, 655, 990
not plotting observations with leverage one:
638, 655, 990
step(modelinteract, k=log(nrow(data_train)))
Start: AIC=-6393.34
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual:Baths:Neighborhood +
Overall.Qual:Baths + Overall.Qual:Neighborhood + Overall.Qual +
Baths + Neighborhood + Overall.Qual:Baths:log(Total.Bsmt.SF) +
Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) +
Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) + Overall.Cond +
HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating.QC +
Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual +
Functional + Fireplaces + Paved.Drive + log(Pool.Area)
Df Sum of Sq RSS AIC
- Overall.Qual:Baths:Neighborhood 25 0.27651 11.351 -6539.2
- Overall.Qual:log(Total.Bsmt.SF):Garage.Cars:log(TotalSq) 1 0.00141 11.076 -6400.5
- Overall.Qual:Baths:log(Total.Bsmt.SF) 1 0.00289 11.077 -6400.3
<none> 11.075 -6393.3
- log(Pool.Area) 1 0.05840 11.133 -6392.8
- Heating.QC 1 0.07333 11.148 -6390.8
- Bedroom.AbvGr 1 0.07357 11.148 -6390.8
- Foundation 5 0.32271 11.397 -6387.0
- Bsmt.Exposure 4 0.32453 11.399 -6379.4
- Paved.Drive 2 0.25326 11.328 -6374.2
- Central.Air 1 0.20054 11.275 -6373.8
- HouseAge 1 0.21102 11.286 -6372.5
- Kitchen.Qual 1 0.24251 11.317 -6368.3
- Bsmt.Qual 5 0.54381 11.618 -6358.3
- Condition.1 4 0.48963 11.564 -6358.0
- Kitchen.AbvGr 1 0.33164 11.406 -6356.6
- Fireplaces 1 0.37225 11.447 -6351.3
- Functional 5 0.62194 11.696 -6348.3
- Overall.Cond 1 1.06255 12.137 -6263.9
- log(Lot.Area) 1 1.18813 12.263 -6248.5
Step: AIC=-6539.23
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Baths + Overall.Qual:Neighborhood + Overall.Qual:log(Total.Bsmt.SF) +
Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) +
log(Total.Bsmt.SF):log(TotalSq) + Overall.Qual:Baths:log(Total.Bsmt.SF) +
Overall.Qual:Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):Garage.Cars:log(TotalSq) +
Overall.Qual:log(Total.Bsmt.SF):Garage.Cars:log(TotalSq)
Df Sum of Sq RSS AIC
- Overall.Qual:Neighborhood 24 0.19443 11.545 -6689.3
- Overall.Qual:Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) 1 0.00006 11.351 -6546.5
- Overall.Qual:Baths:log(Total.Bsmt.SF) 1 0.00263 11.354 -6546.2
- log(Pool.Area) 1 0.04903 11.400 -6540.1
<none> 11.351 -6539.2
- Heating.QC 1 0.07445 11.425 -6536.8
- Bedroom.AbvGr 1 0.09175 11.443 -6534.5
- Foundation 5 0.34270 11.694 -6531.4
- Bsmt.Exposure 4 0.34383 11.695 -6523.9
- Central.Air 1 0.18322 11.534 -6522.6
- HouseAge 1 0.18603 11.537 -6522.3
- Paved.Drive 2 0.26057 11.612 -6520.0
- Kitchen.Qual 1 0.24544 11.596 -6514.6
- Fireplaces 1 0.33425 11.685 -6503.2
- Bsmt.Qual 5 0.56586 11.917 -6503.1
- Kitchen.AbvGr 1 0.35655 11.707 -6500.4
- Condition.1 4 0.56029 11.911 -6496.5
- Functional 5 0.62699 11.978 -6495.5
- Overall.Cond 1 1.15804 12.509 -6401.5
- log(Lot.Area) 1 1.31673 12.668 -6382.7
Step: AIC=-6689.28
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Baths + Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Baths:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars:log(TotalSq) +
Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) + Overall.Qual:Garage.Cars:log(Total.Bsmt.SF):log(TotalSq)
Df Sum of Sq RSS AIC
- Overall.Qual:Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) 1 0.00034 11.546 -6696.5
- Overall.Qual:Baths:log(Total.Bsmt.SF) 1 0.00121 11.547 -6696.4
- log(Pool.Area) 1 0.05614 11.602 -6689.3
<none> 11.545 -6689.3
- Heating.QC 1 0.06801 11.613 -6687.8
- Bedroom.AbvGr 1 0.09581 11.641 -6684.2
- Foundation 5 0.37587 11.921 -6678.0
- HouseAge 1 0.19414 11.740 -6671.7
- Central.Air 1 0.19723 11.743 -6671.3
- Bsmt.Exposure 4 0.37270 11.918 -6671.1
- Paved.Drive 2 0.26302 11.808 -6670.3
- Kitchen.Qual 1 0.27526 11.821 -6661.4
- Fireplaces 1 0.30209 11.848 -6658.0
- Kitchen.AbvGr 1 0.33387 11.879 -6654.0
- Bsmt.Qual 5 0.63863 12.184 -6645.4
- Functional 5 0.65217 12.198 -6643.8
- Condition.1 4 0.59422 12.140 -6643.6
- Neighborhood 27 2.65674 14.202 -6577.4
- Overall.Cond 1 1.17879 12.724 -6551.4
- log(Lot.Area) 1 1.36912 12.915 -6529.3
Step: AIC=-6696.54
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Baths + Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Baths:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars:log(TotalSq) +
Garage.Cars:log(Total.Bsmt.SF):log(TotalSq)
Df Sum of Sq RSS AIC
- Overall.Qual:Baths:log(Total.Bsmt.SF) 1 0.00191 11.548 -6703.6
- Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) 1 0.00473 11.550 -6703.2
- log(Pool.Area) 1 0.05614 11.602 -6696.6
<none> 11.546 -6696.5
- Heating.QC 1 0.06810 11.614 -6695.1
- Bedroom.AbvGr 1 0.09669 11.643 -6691.4
- Foundation 5 0.37556 11.921 -6685.3
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.15669 11.702 -6683.7
- HouseAge 1 0.19387 11.740 -6679.0
- Central.Air 1 0.19691 11.743 -6678.6
- Bsmt.Exposure 4 0.37501 11.921 -6678.1
- Paved.Drive 2 0.26427 11.810 -6677.4
- Kitchen.Qual 1 0.27605 11.822 -6668.6
- Fireplaces 1 0.30183 11.848 -6665.3
- Kitchen.AbvGr 1 0.33539 11.881 -6661.1
- Bsmt.Qual 5 0.64444 12.190 -6652.0
- Functional 5 0.65188 12.198 -6651.1
- Condition.1 4 0.59396 12.140 -6650.9
- Neighborhood 27 2.65682 14.203 -6584.7
- Overall.Cond 1 1.18054 12.726 -6558.5
- log(Lot.Area) 1 1.36879 12.915 -6536.6
Step: AIC=-6703.6
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Baths + Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Garage.Cars:log(TotalSq) + Garage.Cars:log(Total.Bsmt.SF):log(TotalSq)
Df Sum of Sq RSS AIC
- Overall.Qual:Baths 1 0.00009 11.548 -6710.9
- Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) 1 0.00435 11.552 -6710.4
- Overall.Qual:log(Total.Bsmt.SF) 1 0.00448 11.552 -6710.3
- log(Pool.Area) 1 0.05662 11.604 -6703.6
<none> 11.548 -6703.6
- Heating.QC 1 0.06780 11.616 -6702.2
- Bedroom.AbvGr 1 0.09749 11.645 -6698.4
- Foundation 5 0.37443 11.922 -6692.5
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.16502 11.713 -6689.7
- HouseAge 1 0.19304 11.741 -6686.2
- Central.Air 1 0.19873 11.746 -6685.4
- Paved.Drive 2 0.26390 11.812 -6684.5
- Bsmt.Exposure 4 0.38115 11.929 -6684.4
- Kitchen.Qual 1 0.27478 11.822 -6675.8
- Fireplaces 1 0.30256 11.850 -6672.3
- Kitchen.AbvGr 1 0.34713 11.895 -6666.7
- Functional 5 0.65003 12.198 -6658.4
- Condition.1 4 0.59210 12.140 -6658.2
- Bsmt.Qual 5 0.67344 12.221 -6655.5
- Neighborhood 27 2.65828 14.206 -6591.6
- Overall.Cond 1 1.20245 12.750 -6563.0
- log(Lot.Area) 1 1.38904 12.937 -6541.3
Step: AIC=-6710.9
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Garage.Cars:log(TotalSq) + Garage.Cars:log(Total.Bsmt.SF):log(TotalSq)
Df Sum of Sq RSS AIC
- Garage.Cars:log(Total.Bsmt.SF):log(TotalSq) 1 0.00429 11.552 -6717.7
- Overall.Qual:log(Total.Bsmt.SF) 1 0.00457 11.552 -6717.6
- log(Pool.Area) 1 0.05665 11.604 -6710.9
<none> 11.548 -6710.9
- Heating.QC 1 0.06844 11.616 -6709.4
- Bedroom.AbvGr 1 0.09743 11.645 -6705.7
- Foundation 5 0.37460 11.922 -6699.8
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.17457 11.722 -6695.8
- HouseAge 1 0.19344 11.741 -6693.4
- Central.Air 1 0.20024 11.748 -6692.5
- Paved.Drive 2 0.26383 11.812 -6691.8
- Bsmt.Exposure 4 0.38132 11.929 -6691.6
- Kitchen.Qual 1 0.27537 11.823 -6683.0
- Fireplaces 1 0.30385 11.852 -6679.4
- Kitchen.AbvGr 1 0.34733 11.895 -6674.0
- Functional 5 0.65081 12.199 -6665.6
- Condition.1 4 0.59303 12.141 -6665.4
- Bsmt.Qual 5 0.67508 12.223 -6662.6
- Baths 1 0.84956 12.397 -6612.2
- Neighborhood 27 2.65943 14.207 -6598.8
- Overall.Cond 1 1.20663 12.754 -6569.8
- log(Lot.Area) 1 1.38984 12.938 -6548.5
Step: AIC=-6717.66
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:log(Total.Bsmt.SF) + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + log(Total.Bsmt.SF):log(TotalSq) +
Overall.Qual:Garage.Cars:log(TotalSq)
Df Sum of Sq RSS AIC
- Overall.Qual:log(Total.Bsmt.SF) 1 0.00482 11.557 -6724.3
- log(Total.Bsmt.SF):log(TotalSq) 1 0.00739 11.559 -6724.0
- log(Pool.Area) 1 0.05656 11.609 -6717.7
<none> 11.552 -6717.7
- Heating.QC 1 0.06773 11.620 -6716.2
- Bedroom.AbvGr 1 0.09473 11.647 -6712.8
- Foundation 5 0.37508 11.927 -6706.5
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.17252 11.725 -6702.8
- HouseAge 1 0.19341 11.745 -6700.2
- Central.Air 1 0.20225 11.754 -6699.1
- Paved.Drive 2 0.26022 11.812 -6699.0
- Bsmt.Exposure 4 0.38110 11.933 -6698.4
- Kitchen.Qual 1 0.27265 11.825 -6690.1
- Fireplaces 1 0.30489 11.857 -6686.1
- Kitchen.AbvGr 1 0.34493 11.897 -6681.0
- Functional 5 0.64652 12.199 -6672.9
- Condition.1 4 0.59558 12.148 -6671.8
- Bsmt.Qual 5 0.68287 12.235 -6668.5
- Baths 1 0.85824 12.410 -6618.0
- Neighborhood 27 2.65589 14.208 -6606.0
- Overall.Cond 1 1.21843 12.771 -6575.3
- log(Lot.Area) 1 1.38629 12.938 -6555.8
Step: AIC=-6724.34
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) +
log(Total.Bsmt.SF):log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq)
Df Sum of Sq RSS AIC
- log(Total.Bsmt.SF):log(TotalSq) 1 0.00423 11.561 -6731.1
- log(Pool.Area) 1 0.05580 11.613 -6724.5
<none> 11.557 -6724.3
- Heating.QC 1 0.06690 11.624 -6723.0
- Bedroom.AbvGr 1 0.09771 11.655 -6719.1
- Foundation 5 0.37544 11.932 -6713.2
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.16899 11.726 -6710.0
- HouseAge 1 0.19173 11.749 -6707.1
- Paved.Drive 2 0.25664 11.813 -6706.2
- Central.Air 1 0.20052 11.757 -6706.0
- Bsmt.Exposure 4 0.38280 11.940 -6704.9
- Kitchen.Qual 1 0.27739 11.834 -6696.2
- Fireplaces 1 0.30403 11.861 -6692.9
- Kitchen.AbvGr 1 0.34163 11.899 -6688.2
- Functional 5 0.64326 12.200 -6680.0
- Condition.1 4 0.59232 12.149 -6679.0
- Bsmt.Qual 5 0.73185 12.289 -6669.2
- Baths 1 0.85917 12.416 -6624.6
- Neighborhood 27 2.65109 14.208 -6613.3
- Overall.Cond 1 1.21771 12.775 -6582.1
- log(Lot.Area) 1 1.39657 12.953 -6561.3
Step: AIC=-6731.1
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + log(Pool.Area) +
Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) +
Overall.Qual:Garage.Cars:log(TotalSq)
Df Sum of Sq RSS AIC
- log(Pool.Area) 1 0.05569 11.617 -6731.2
<none> 11.561 -6731.1
- Heating.QC 1 0.06618 11.627 -6729.9
- Bedroom.AbvGr 1 0.09745 11.659 -6725.9
- Foundation 5 0.37157 11.933 -6720.4
- HouseAge 1 0.19214 11.753 -6713.8
- Paved.Drive 2 0.26198 11.823 -6712.3
- Bsmt.Exposure 4 0.38138 11.943 -6711.9
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.20815 11.769 -6711.8
- Central.Air 1 0.21974 11.781 -6710.3
- Kitchen.Qual 1 0.27785 11.839 -6703.0
- Fireplaces 1 0.30489 11.866 -6699.5
- Kitchen.AbvGr 1 0.33922 11.900 -6695.2
- Functional 5 0.63914 12.200 -6687.3
- Condition.1 4 0.59007 12.151 -6686.0
- Bsmt.Qual 5 0.72871 12.290 -6676.4
- Baths 1 0.85871 12.420 -6631.4
- Neighborhood 27 2.64847 14.210 -6620.5
- Overall.Cond 1 1.21550 12.777 -6589.2
- log(Lot.Area) 1 1.40140 12.963 -6567.6
- log(Total.Bsmt.SF) 1 1.41469 12.976 -6566.1
Step: AIC=-6731.24
log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual + Baths +
Neighborhood + Garage.Cars + log(Total.Bsmt.SF) + log(TotalSq) +
Overall.Cond + HouseAge + Foundation + Bsmt.Qual + Bsmt.Exposure +
Heating.QC + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr +
Kitchen.Qual + Functional + Fireplaces + Paved.Drive + Overall.Qual:Garage.Cars +
Overall.Qual:log(TotalSq) + Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq)
Df Sum of Sq RSS AIC
<none> 11.617 -6731.2
- Heating.QC 1 0.06327 11.680 -6730.4
- Bedroom.AbvGr 1 0.09757 11.714 -6726.1
- Foundation 5 0.37226 11.989 -6720.7
- HouseAge 1 0.19580 11.813 -6713.6
- Paved.Drive 2 0.25832 11.875 -6713.0
- Bsmt.Exposure 4 0.38607 12.003 -6711.7
- Overall.Qual:Garage.Cars:log(TotalSq) 1 0.21586 11.833 -6711.1
- Central.Air 1 0.22367 11.841 -6710.1
- Kitchen.Qual 1 0.29150 11.908 -6701.5
- Fireplaces 1 0.32266 11.939 -6697.6
- Kitchen.AbvGr 1 0.33577 11.953 -6696.0
- Functional 5 0.62605 12.243 -6689.4
- Condition.1 4 0.57680 12.194 -6688.1
- Bsmt.Qual 5 0.71459 12.331 -6678.7
- Baths 1 0.87194 12.489 -6630.5
- Neighborhood 27 2.63950 14.256 -6622.9
- Overall.Cond 1 1.18396 12.801 -6593.6
- log(Total.Bsmt.SF) 1 1.39321 13.010 -6569.4
- log(Lot.Area) 1 1.40167 13.018 -6568.5
Call:
lm(formula = log(price) ~ log(Lot.Area) + Condition.1 + Overall.Qual +
Baths + Neighborhood + Garage.Cars + log(Total.Bsmt.SF) +
log(TotalSq) + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + Heating.QC + Central.Air + Bedroom.AbvGr +
Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces +
Paved.Drive + Overall.Qual:Garage.Cars + Overall.Qual:log(TotalSq) +
Garage.Cars:log(TotalSq) + Overall.Qual:Garage.Cars:log(TotalSq),
data = data_train)
Coefficients:
(Intercept) log(Lot.Area)
6.5102614 0.0987836
Condition.1Artery Condition.1Feedr
-0.0736717 -0.0745876
Condition.1Park Condition.1Rail
-0.0051531 -0.0456100
Overall.Qual Baths
0.2068733 0.0468993
NeighborhoodBlueste NeighborhoodBrDale
0.0203947 -0.0722602
NeighborhoodBrkSide NeighborhoodClearCr
-0.0217135 0.0103092
NeighborhoodCollgCr NeighborhoodCrawfor
-0.0159676 0.0499783
NeighborhoodEdwards NeighborhoodGilbert
-0.1032780 -0.0368491
NeighborhoodGreens NeighborhoodGrnHill
0.0574166 0.4438689
NeighborhoodIDOTRR NeighborhoodLandmrk
-0.1470244 -0.0700365
NeighborhoodMeadowV NeighborhoodMitchel
-0.1441208 -0.0346045
NeighborhoodNAmes NeighborhoodNoRidge
-0.0586303 0.0250677
NeighborhoodNPkVill NeighborhoodNridgHt
-0.0218909 0.0317326
NeighborhoodNWAmes NeighborhoodOldTown
-0.0465829 -0.1206424
NeighborhoodSawyer NeighborhoodSawyerW
-0.0277534 -0.0496333
NeighborhoodSomerst NeighborhoodStoneBr
0.0576101 0.0743440
NeighborhoodSWISU NeighborhoodTimber
-0.0658180 -0.0245197
NeighborhoodVeenker Garage.Cars
-0.0060078 1.2032426
log(Total.Bsmt.SF) log(TotalSq)
0.1189555 0.5012898
Overall.Cond HouseAge
0.0336221 -0.0009157
FoundationCBlock FoundationPConc
0.0589952 0.0731980
FoundationSlab FoundationStone
0.0718932 0.0044701
FoundationWood Bsmt.QualEx
0.0392228 -0.7632015
Bsmt.QualFa Bsmt.QualGd
-0.8262069 -0.8267350
Bsmt.QualPo Bsmt.QualTA
-0.8358514 -0.8376454
Bsmt.ExposureAv Bsmt.ExposureGd
0.1575361 0.1836363
Bsmt.ExposureMn Bsmt.ExposureNo
0.1122697 0.1280950
Heating.QC Central.AirY
0.0093065 0.0636338
Bedroom.AbvGr Kitchen.AbvGr
-0.0147371 -0.0910776
Kitchen.Qual FunctionalMaj2
0.0332874 -0.1605476
FunctionalMin1 FunctionalMin2
0.0184235 0.0407226
FunctionalMod FunctionalTyp
0.0029487 0.0888654
Fireplaces Paved.DriveP
0.0296062 0.0171341
Paved.DriveY Overall.Qual:Garage.Cars
0.0605392 -0.1928486
Overall.Qual:log(TotalSq) Garage.Cars:log(TotalSq)
-0.0225185 -0.1633273
Overall.Qual:Garage.Cars:log(TotalSq)
0.0269148
library(tree)
tree = tree(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
# use cross-validation to prune
cv.tree = cv.tree(tree)
plot(cv.tree$size, cv.tree$dev, type='b')
bb = cv.tree$size[which.min(cv.tree$dev)]
# prune the tree for the size that is optimal based on CV
prune.tree=prune.tree(tree, best=bb)
# plot the pruned tree
plot(prune.tree)
text(prune.tree, pretty=0)
prune.tree
node), split, n, deviance, yval
* denotes terminal node
1) root 1493 206.500 12.01
2) Overall.Qual < 6.5 977 70.660 11.83
4) Baths < 2.25 718 42.240 11.75
8) Overall.Qual < 4.5 136 12.510 11.49
16) log(Total.Bsmt.SF) < 6.61667 74 5.731 11.34 *
17) log(Total.Bsmt.SF) > 6.61667 62 3.197 11.67 *
9) Overall.Qual > 4.5 582 18.830 11.81
18) log(Total.Bsmt.SF) < 6.89315 343 9.468 11.75 *
19) log(Total.Bsmt.SF) > 6.89315 239 6.576 11.89 *
5) Baths > 2.25 259 10.630 12.05
10) Neighborhood: Blueste,BrDale,Edwards,IDOTRR,Landmrk,MeadowV,NAmes,NPkVill,OldTown,Sawyer,SWISU 103 4.687 11.92 *
11) Neighborhood: BrkSide,ClearCr,CollgCr,Crawfor,Gilbert,Mitchel,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,Timber,Veenker 156 3.029 12.14 *
3) Overall.Qual > 6.5 516 43.610 12.35
6) Garage.Cars < 2.5 382 19.670 12.25
12) log(TotalSq) < 7.30149 135 3.599 12.09 *
13) log(TotalSq) > 7.30149 247 10.730 12.34
26) log(Total.Bsmt.SF) < 6.93537 111 2.919 12.20 *
27) log(Total.Bsmt.SF) > 6.93537 136 4.112 12.45 *
7) Garage.Cars > 2.5 134 9.435 12.63
14) log(Total.Bsmt.SF) < 7.44891 93 3.847 12.53 *
15) log(Total.Bsmt.SF) > 7.44891 41 2.029 12.88 *
summary(model.BIC)
Call:
lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Bsmt.Qual +
Bsmt.Exposure + log(Total.Bsmt.SF) + Heating.QC + Central.Air +
Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional +
Fireplaces + Garage.Cars + Paved.Drive + log(Pool.Area) +
log(TotalSq), data = data_train)
Residuals:
Min 1Q Median 3Q Max
-0.66831 -0.05150 0.00025 0.05629 0.32251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3926554 0.1169000 63.239 < 2e-16 ***
log(Lot.Area) 0.1026436 0.0075547 13.587 < 2e-16 ***
NeighborhoodBlueste 0.0146631 0.0419500 0.350 0.726738
NeighborhoodBrDale -0.0634618 0.0347412 -1.827 0.067954 .
NeighborhoodBrkSide -0.0219612 0.0306986 -0.715 0.474491
NeighborhoodClearCr 0.0028061 0.0338084 0.083 0.933862
NeighborhoodCollgCr -0.0131967 0.0275606 -0.479 0.632136
NeighborhoodCrawfor 0.0479177 0.0311068 1.540 0.123679
NeighborhoodEdwards -0.1000400 0.0294309 -3.399 0.000695 ***
NeighborhoodGilbert -0.0379098 0.0287188 -1.320 0.187036
NeighborhoodGreens 0.0359676 0.0442891 0.812 0.416864
NeighborhoodGrnHill 0.4397167 0.0709210 6.200 7.38e-10 ***
NeighborhoodIDOTRR -0.1451682 0.0324860 -4.469 8.49e-06 ***
NeighborhoodLandmrk -0.0681952 0.0949941 -0.718 0.472943
NeighborhoodMeadowV -0.1266187 0.0383000 -3.306 0.000970 ***
NeighborhoodMitchel -0.0368471 0.0295088 -1.249 0.211987
NeighborhoodNAmes -0.0565614 0.0285542 -1.981 0.047801 *
NeighborhoodNoRidge 0.0623060 0.0303150 2.055 0.040034 *
NeighborhoodNPkVill -0.0208120 0.0384833 -0.541 0.588726
NeighborhoodNridgHt 0.0509080 0.0290237 1.754 0.079643 .
NeighborhoodNWAmes -0.0544560 0.0297828 -1.828 0.067693 .
NeighborhoodOldTown -0.1188641 0.0294370 -4.038 5.68e-05 ***
NeighborhoodSawyer -0.0229563 0.0298927 -0.768 0.442639
NeighborhoodSawyerW -0.0515292 0.0289758 -1.778 0.075559 .
NeighborhoodSomerst 0.0566938 0.0276847 2.048 0.040759 *
NeighborhoodStoneBr 0.0745779 0.0321385 2.321 0.020453 *
NeighborhoodSWISU -0.0604508 0.0340779 -1.774 0.076293 .
NeighborhoodTimber -0.0324647 0.0317804 -1.022 0.307176
NeighborhoodVeenker -0.0167331 0.0376356 -0.445 0.656671
Condition.1Artery -0.0760394 0.0151294 -5.026 5.64e-07 ***
Condition.1Feedr -0.0767356 0.0113456 -6.763 1.96e-11 ***
Condition.1Park 0.0069725 0.0182004 0.383 0.701706
Condition.1Rail -0.0496206 0.0149453 -3.320 0.000922 ***
Overall.Qual 0.0494652 0.0034280 14.430 < 2e-16 ***
Overall.Cond 0.0351455 0.0027881 12.605 < 2e-16 ***
HouseAge -0.0008503 0.0001883 -4.516 6.84e-06 ***
FoundationCBlock 0.0583435 0.0105020 5.555 3.30e-08 ***
FoundationPConc 0.0718433 0.0115567 6.217 6.66e-10 ***
FoundationSlab 0.0630207 0.0291017 2.166 0.030512 *
FoundationStone 0.0038723 0.0390916 0.099 0.921107
FoundationWood 0.0465776 0.0554929 0.839 0.401418
Bsmt.QualEx -0.7570306 0.1136937 -6.659 3.94e-11 ***
Bsmt.QualFa -0.8357544 0.1124732 -7.431 1.85e-13 ***
Bsmt.QualGd -0.8407558 0.1123115 -7.486 1.24e-13 ***
Bsmt.QualPo -0.8411315 0.1443930 -5.825 7.04e-09 ***
Bsmt.QualTA -0.8503101 0.1122157 -7.577 6.31e-14 ***
Bsmt.ExposureAv 0.1590409 0.0918543 1.731 0.083588 .
Bsmt.ExposureGd 0.1929930 0.0921545 2.094 0.036415 *
Bsmt.ExposureMn 0.1151061 0.0920620 1.250 0.211392
Bsmt.ExposureNo 0.1296663 0.0917945 1.413 0.158000
log(Total.Bsmt.SF) 0.1191674 0.0091824 12.978 < 2e-16 ***
Heating.QC 0.0091548 0.0033638 2.722 0.006577 **
Central.AirY 0.0587416 0.0121548 4.833 1.49e-06 ***
Baths 0.0470497 0.0045251 10.397 < 2e-16 ***
Bedroom.AbvGr -0.0140992 0.0042615 -3.308 0.000961 ***
Kitchen.AbvGr -0.0977644 0.0140392 -6.964 5.05e-12 ***
Kitchen.Qual 0.0346078 0.0056027 6.177 8.51e-10 ***
FunctionalMaj2 -0.1661268 0.0518564 -3.204 0.001387 **
FunctionalMin1 0.0143539 0.0345741 0.415 0.678085
FunctionalMin2 0.0352368 0.0342227 1.030 0.303356
FunctionalMod -0.0059146 0.0372751 -0.159 0.873948
FunctionalTyp 0.0882264 0.0312921 2.819 0.004877 **
Fireplaces 0.0278606 0.0047452 5.871 5.37e-09 ***
Garage.Cars 0.0400643 0.0046975 8.529 < 2e-16 ***
Paved.DriveP 0.0101521 0.0181899 0.558 0.576854
Paved.DriveY 0.0573398 0.0114111 5.025 5.67e-07 ***
log(Pool.Area) 0.0175609 0.0059007 2.976 0.002969 **
log(TotalSq) 0.3701184 0.0154033 24.029 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09123 on 1425 degrees of freedom
Multiple R-squared: 0.9426, Adjusted R-squared: 0.9399
F-statistic: 349 on 67 and 1425 DF, p-value: < 2.2e-16
plot(model.BIC)
not plotting observations with leverage one:
638, 655, 990
not plotting observations with leverage one:
638, 655, 990
# Exploring the remaining predictors relationship to price
# plot(log(data_train$Lot.Area), log(data_train$price))
# plot((data_train$Neighborhood), log(data_train$price))
# plot((data_train$Condition.1), log(data_train$price))
# plot((data_train$Overall.Qual), log(data_train$price))
# plot((data_train$Overall.Cond), log(data_train$price))
# plot((data_train$HouseAge), log(data_train$price))
# plot((data_train$Bsmt.Qual), log(data_train$price))
# plot((data_train$Bsmt.Exposure), log(data_train$price))
# plot(log(data_train$Total.Bsmt.SF), log(data_train$price))
# plot((data_train$Heating.QC), log(data_train$price))
# plot((data_train$Central.Air), log(data_train$price))
# plot((data_train$Baths), log(data_train$price))
# plot((data_train$Bedroom.AbvGr), log(data_train$price))
# plot((data_train$Kitchen.AbvGr), log(data_train$price))
# plot((data_train$Kitchen.Qual), log(data_train$price))
# plot((data_train$Functional), log(data_train$price))
# plot((data_train$Fireplaces), log(data_train$price))
# plot((data_train$Paved.Drive), log(data_train$price))
# plot((data_train$Garage.Cars), log(data_train$price))
# plot((data_train$Garage.Cars), log(data_train$price))
# plot(log(1+data_train$Pool.Area), log(data_train$price))
# plot(log(data_train$TotalSq), log(data_train$price))
termplot(model.BIC, partial.resid = TRUE, col.res = "purple", cex = 0.5,
rug = T, se = T, smooth = panel.smooth)
# There are 3 high leverage points - may want to exclude them
hh<-hatvalues(model.BIC)
id<-which(hh==1)
plot(hatvalues(model.BIC), type = "h")
Yhat = predict(model, newdata=data_test, interval="predict")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Mas.Vnr.Type has new levels
sqrt(mean((exp(yhat.ridge)-data_test$price)^2))
[1] 16492.59
bart.gop = bart(x.train= dplyr::select(data_train, -price),
y.train=data_train["price"],
x.test=dplyr::select(data_test, -price),verbose=FALSE)
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "c('double', 'numeric')"
library(MASS)
#Full Model
model=lm(log(price) ~ MS.SubClass + MS.Zoning + log(Lot.Frontage) + log(Lot.Area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + HouseAge + Roof.Style + Roof.Matl + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + log(1+Mas.Vnr.Area) + Exter.Qual + Exter.Cond + Foundation + Bsmt.YN:Bsmt.Qual + Bsmt.YN:Bsmt.Cond + Bsmt.YN:Bsmt.Exposure + Total.Bsmt.SF + Heating + Heating.QC + Central.Air + Electrical + log(X12.SF) + log(1+Low.Qual.Fin.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + Functional + Fireplaces + Fireplace.YN:Fireplace.Qu + Garage.YN:Garage.Type + Garage.YN:Garage.Finish + Garage.Cars + Garage.YN:Garage.Cond + Garage.YN:Garage.Qual + Paved.Drive + log(1+Pool.Area) + Pool.YN:Pool.QC + Fence + Misc.Val + Mo.Sold + Yr.Sold + Sale.Type + TotalSq, data=train)
summary(model)
boxcox(model)
#BIC model selection
step(model, k=log(nrow(train)))
#step(model, k=2)
# OLD
# model=lm(log(price)~ms.subclass + MS.Zoning + log(Lot.Frontage) + log(area) + Street + Alley + Lot.Shape + Land.Contour + Lot.Config + Land.Slope + Neighborhood + Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond + Year.Built + Year.Remod.Add + Roof.Style + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + log(Mas.Vnr.Area + 1) + Exter.Qual + Exter.Cond + Foundation + Bsmt.Qual + Bsmt.Cond + Bsmt.Exposure + Heating.QC + Central.Air + Electrical + log(X1st.Flr.SF) + log(X2nd.Flr.SF) + Baths + Bedroom.AbvGr + Kitchen.AbvGr + TotRms.AbvGrd + Paved.Drive + log(Wood.Deck.SF + 1) + log(Open.Porch.SF + 1) + log(Enclosed.Porch + 1) + log(X3Ssn.Porch + 1) + log(Screen.Porch + 1) + log(Misc.Val + 1) + Mo.Sold + Yr.Sold + Sale.Type + log(TotalSq), data=train)
#vif(model)
#summary(model)
#Model Selection by AIC
#step(model, k=2)
##AIC_model = lm(formula = log(price) ~ PID + area + MS.SubClass + MS.Zoning +
# Lot.Area + Lot.Shape + Utilities + Land.Slope + Neighborhood +
# Condition.1 + Bldg.Type + House.Style + Overall.Qual + Overall.Cond +
# Year.Built + Year.Remod.Add + Roof.Style + Roof.Matl + Exterior.1st +
# Exterior.2nd + Exter.Qual + Exter.Cond + Foundation + BsmtFin.SF.1 +
# BsmtFin.SF.2 + Bsmt.Unf.SF + Heating + Heating.QC + Central.Air +
# Electrical + X1st.Flr.SF + X2nd.Flr.SF + Bsmt.Full.Bath +
# Kitchen.Qual + Functional + Garage.Cars + Garage.Area + Paved.Drive +
# Wood.Deck.SF + Open.Porch.SF + Enclosed.Porch + Screen.Porch +
# Mo.Sold + Yr.Sold + has.fence + has.fireplace, data = train)
#summary(AIC_model)
#Model Selection by BIC
#BIC_model = lm(formula = log(price) ~ PID + area + MS.Zoning + Lot.Area +
# Neighborhood + Condition.1 + Bldg.Type + Overall.Qual + Overall.Cond +
# Year.Built + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
# Bsmt.Unf.SF + Central.Air + Bsmt.Full.Bath + Kitchen.Qual +
# Functional + Garage.Cars + Garage.Area + Open.Porch.SF +
# Enclosed.Porch + Screen.Porch + has.fireplace, data = train)
#summary(BIC_model)
#Use Boosting to find important variables
#
#library(gbm)
#
#boost = gbm(log(price) ~ PID + area + MS.Zoning + Lot.Area +
# Neighborhood + Condition.1 + Bldg.Type + Overall.Qual + Overall.Cond +
# Year.Built + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
# Bsmt.Unf.SF + Central.Air + Bsmt.Full.Bath + Kitchen.Qual +
# Functional + Garage.Cars + Garage.Area + Open.Porch.SF +
# Enclosed.Porch + Screen.Porch + has.fireplace, data = train, distribution="gaussian", n.trees=5000, #interaction.depth = 1, shrinkage=0.01, verbose = F)
#summary(boost)
##Sparse Model (removed variables to bring the number below 20)
#sparse_model = lm(log(price) ~ area + MS.Zoning + Lot.Area +
# Neighborhood + Condition.1 + Overall.Qual + Overall.Cond +
# Year.Built + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
# Bsmt.Unf.SF + Central.Air + Bsmt.Full.Bath + Kitchen.Qual +
# Functional + Garage.Cars + Garage.Area + has.fireplace, data = train)
#summary(sparse_model)
#plot(sparse_model)
#termplot(sparse_model,
# data = train)
##Trying to detect non-linearity
#attach(train)
#plot(log(area), log(price))
#plot(log(Lot.Area), log(price))
#plot(log(Overall.Qual), log(price))
#plot(log(Overall.Cond), log(price))
#plot(log(Year.Built), log(price))
#plot(log(Year.Remod.Add, log(price)))
#plot(BsmtFin.SF.1, log(price))
#plot(BsmtFin.SF.2, log(price))
#plot(Bsmt.Unf.SF, log(price))
#plot(Bsmt.Full.Bath, log(price))
#plot(Garage.Cars, log(price))
#plot(Garage.Area, log(price))
#
##OLD Simple model
#model1 = lm(log(price) ~ log(area) + MS.Zoning + log(Lot.Area) +
# Neighborhood + Condition.1 + log(Overall.Qual) + log(Overall.Cond) +
# log(Year.Built) + log(Year.Remod.Add) + BsmtFin.SF.1 + BsmtFin.SF.2 +
# Bsmt.Unf.SF + Central.Air + Bsmt.Full.Bath +
# Functional + Garage.Cars + Garage.Area + has.fireplace, data = train[-c(168, 461, 787),])
#summary(model1)
#plot(model1)
#Simple model (based on BIC selection)
model1 = lm(formula = log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Total.Bsmt.SF +
Central.Air + log(X12.SF) + Baths + Kitchen.AbvGr + Kitchen.Qual +
Functional + Fireplaces + Garage.Cars + Paved.Drive + Bsmt.YN:Bsmt.Exposure,
data = train)
#Plot Price Ranges of each neighborhood
library(ggplot2)
neighborhood.price.range = ggplot(train, aes(x = log(price), y = reorder(Neighborhood, desc(log(price)), median))) +
geom_polygon(color="SlateGray",size = 1, alpha=1) + theme_light() + coord_flip() + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + labs(y="Neighborhood")
plot(neighborhood.price.range)
Create predicted values for price using your model using the testing data
load("ames_test.Rdata")
#Variables with NA's and their proportion of missing data
miss = apply(is.na(ames_train), 2, sum)
miss_prop = round(miss[miss>0]/nrow(ames_train), 3)
print(miss_prop)
which(miss_prop>0.5) # four features have greater than 50% of data missing
#Created binary variables for whether or not a house as an alley, pool, fence, misc. feature, fireplace, basement or garage, as I thought it may be more meaningful- Tom
#May consider adding Lot.Frontage back in, as it only had 286 missing
#Have to decide between filtering out N/A's for different Garage variables and only including 'has.garage'
#Have to decide between filtering out N/A's for different Basement variables and only including 'has.basement'
test <- ames_test %>% filter(!is.na(Lot.Frontage)) %>%
mutate(Alley = factor(Alley, levels = levels(addNA(Alley)), labels = c(levels(Alley), "None"), exclude = NULL)) %>%
mutate(HouseAge = Yr.Sold- pmax(Year.Built, Year.Remod.Add)) %>%
filter(!is.na(Mas.Vnr.Area)) %>%
mutate(Bsmt.YN = !is.na(Bsmt.Qual)) %>%
mutate(Bsmt.Qual = factor(Bsmt.Qual, levels = levels(addNA(Bsmt.Qual)), labels = c(levels(Bsmt.Qual), "NoBa"), exclude = NULL)) %>%
mutate(Bsmt.Cond = factor(Bsmt.Cond, levels = levels(addNA(Bsmt.Cond)), labels = c(levels(Bsmt.Cond), "NoBa"), exclude = NULL)) %>%
mutate(Bsmt.Exposure = factor(Bsmt.Exposure, levels = levels(addNA(Bsmt.Exposure)), labels = c(levels(Bsmt.Exposure), "NoBa"), exclude = NULL)) %>%
mutate(BsmtFin.Type.1= factor(BsmtFin.Type.1, levels = levels(addNA(BsmtFin.Type.1)), labels = c(levels(BsmtFin.Type.1), "NoBa"), exclude = NULL)) %>%
mutate(BsmtFin.Type.2= factor(BsmtFin.Type.2, levels = levels(addNA(BsmtFin.Type.2)), labels = c(levels(BsmtFin.Type.2), "NoBa"), exclude = NULL)) %>%
mutate(X12.SF= X1st.Flr.SF+ X2nd.Flr.SF) %>%
filter(!is.na(Bsmt.Full.Bath)) %>%
filter(!is.na(Bsmt.Half.Bath)) %>%
mutate(Baths = Bsmt.Full.Bath + 0.5*Bsmt.Half.Bath + Full.Bath + 0.5*Half.Bath) %>%
mutate(Fireplace.YN = Fireplaces>0) %>%
mutate(Fireplace.Qu = factor(Fireplace.Qu, levels = levels(addNA(Fireplace.Qu)), labels = c(levels(Fireplace.Qu), "None"), exclude = NULL)) %>%
mutate(Garage.YN = !is.na(Garage.Cond)) %>%
mutate(Garage.Type = factor(Garage.Type, levels = levels(addNA(Garage.Type)), labels = c(levels(Garage.Type), "None"), exclude = NULL)) %>%
mutate(Garage.Finish = factor(Garage.Finish, levels = levels(addNA(Garage.Finish)), labels = c(levels(Garage.Finish), "None"), exclude = NULL)) %>%
mutate(Garage.Qual = factor(Garage.Qual, levels = levels(addNA(Garage.Qual)), labels = c(levels(Garage.Qual), "None"), exclude = NULL)) %>%
mutate(Garage.Cond = factor(Garage.Cond, levels = levels(addNA(Garage.Cond)), labels = c(levels(Garage.Cond), "None"), exclude = NULL)) %>%
mutate(Porch.Area = Wood.Deck.SF+ Open.Porch.SF+Enclosed.Porch+X3Ssn.Porch + Screen.Porch) %>%
mutate(Pool.YN = Pool.Area>0) %>%
mutate(Pool.QC = factor(Pool.QC, levels = levels(addNA(Pool.QC)), labels = c(levels(Pool.QC), "None"), exclude = NULL)) %>%
mutate(Fence = factor(Fence, levels = levels(addNA(Fence)), labels = c(levels(Fence), "None"), exclude = NULL)) %>%
mutate(Misc.Feature = factor(Misc.Feature, levels = levels(addNA(Misc.Feature)), labels = c(levels(Misc.Feature), "None"), exclude = NULL)) %>%
mutate(Mo.Sold = as.factor(Mo.Sold)) %>%
mutate(Yr.Sold = as.factor(Yr.Sold)) %>%
dplyr::select(-Garage.Yr.Blt) %>% filter(Condition.1 != "RRNe") %>% filter(Kitchen.Qual != "Po")
You should save your predictions in a dataframe with columns for PID (property identifier), fit, predicted values on the test data, and where possible lwr and upr, lower and upper 95% interval estimates for predicting price.
Y = test$price
Yhat = exp(Yhat)
#Bias
mean(Yhat[,1] - Y)
#Maximum Deviation
max(abs(Y - Yhat[,1]))
#Mean Absolute Deviation
mean(abs(Y - Yhat[,1]))
#RMSE
sqrt(mean((Y - Yhat[,1])^2))
#Coverage
mean(Yhat[,"lwr"] < Y & Yhat[,"upr"] > Y)
# name dataframe as predictions! DO NOT CHANGE
predictions = as.data.frame(Yhat)
predictions$PID = test$PID
save(predictions, file="predict.Rdata")
Your models will be evaluated on the following criteria on the test data:
* Bias: Average (Yhat-Y) positive values indicate the model tends to overestimate price (on average) while negative values indicate the model tends to underestimate price.
* Maximum Deviation: Max |Y-Yhat| - identifies the worst prediction made in the validation data set.
* Mean Absolute Deviation: Average |Y-Yhat| - the average error (regardless of sign).
Root Mean Square Error: Sqrt Average (Y-Yhat)^2
Coverage: Average( lwr < Y < upr)
In order to have a passing wercker badge, your file for predictions needs to be the same length as the test data, with three columns: fitted values, lower CI and upper CI values in that order with names, fit, lwr, and upr respectively.
You will be able to see your scores on the score board (coming soon!). They will be initialized by a predction based on the mean in the training data.
Model Check - Test your prediction on the first observation in the training and test data set to make sure that the model gives a reasonable answer and include this in a supplement of your report. This should be done BY HAND using a calculator (this means use the raw data from the original dataset and manually calculate all transformations and interactions with your calculator)! Models that do not give reasonable answers will be given a minimum 2 letter grade reduction. Also be careful as you cannot use certain transformations [log or inverse x] if a variable has values of 0.
In this part you may go all out for constructing a best fitting model for predicting housing prices using methods that we have covered this semester. You should feel free to to create any new variables (such as quadratic, interaction, or indicator variables, splines, etc). The variable TotalSq = X1st.Flr.SF+X2nd.Flr.SF was added to the dataframe (that does not include basement area, so you may improve on this. A relative grade is assigned by comparing your fit on the test set to that of your fellow students with bonus points awarded to those who substantially exceed their fellow students and point reductions occurring for models which fit exceedingly poorly.
library(glmnet)
#Design matrices
X.train = model.matrix(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Total.Bsmt.SF +
Central.Air + log(X12.SF) + Baths + Kitchen.AbvGr + Kitchen.Qual +
Functional + Fireplaces + Garage.Cars + Paved.Drive + Bsmt.YN:Bsmt.Exposure,
data = data)
X.test = model.matrix(log(price) ~ log(Lot.Area) + Neighborhood + Condition.1 +
Overall.Qual + Overall.Cond + HouseAge + Foundation + Total.Bsmt.SF +
Central.Air + log(X12.SF) + Baths + Kitchen.AbvGr + Kitchen.Qual +
Functional + Fireplaces + Garage.Cars + Paved.Drive + Bsmt.YN:Bsmt.Exposure,
data = test)
#Fit lasso regression
price.lasso = glmnet(X.train, log(data$price), alpha=1)
cv.lasso = cv.glmnet(X.train, log(data$price), alpha=1)
#Obtain prediction on test data
yhat.lasso.test = predict(app.lasso, s=cv.lasso$lambda.min, type="response", newx = X.test)
#Compute RMSE for test data
rmse.lasso.test = rmse(test$price, exp(yhat.lasso.test))
rmse.tab$lasso<-rmse.lasso.test
Update your predictions using your complex model to provide point estimates and CI.
You may iterate here as much as you like exploring different models until you are satisfied with your results.
Once you are satisfied with your model, provide a write up of your data analysis project in a new Rmd file/pdf file: writeup.Rmd by copying over salient parts of your R notebook. The written assignment consists of five parts:
Exploratory data analysis (20 points): must include three correctly labeled graphs and an explanation that highlight the most important features that went into your model building.
Development and assessment of an initial model from Part I (10 points)
Initial model: must include a summary table and an explanation/discussion for variable selection. Interpretation of coefficients desirable for full points.
Model selection: must include a discussion
Residual: must include a residual plot and a discussion
RMSE: must include an RMSE and an explanation (other criteria desirable)
Model testing: must include an explanation
Final model: must include a summary table
Variables: must include an explanation
Variable selection/shrinkage: must use appropriate method and include an explanation
Residual: must include a residual plot and a discussion
RMSE: must include an RMSE and an explanation (other criteria desirable)
Model evaluation: must include an evaluation discussion
Model testing : must include a discussion
Model result: must include a selection of the top 10 undervalued and overvalued houses
Create predictions for the validation data from your final model and write out to a file prediction-validation.Rdata This should have the same format as the models in Part I and II.
10 points
Each Group should prepare 5 slides in their Github repo: (save as slides.pdf)
Most interesting graphic (a picture is worth a thousand words prize!)
Best Model (motivation, how you found it, why you think it is best)
Best Insights into predicting Sales Price.
2 Best Houses to purchase (and why)
Best Team Name/Graphic
We will select winners based on the above criteria and overall performance.
Finally your repo should have: writeup.Rmd, writeup.pdf, slides.Rmd (and whatever output you use for the presentation) and predict.Rdata and predict-validation.Rdata.